For the EDA project, the data set is the Prosper Loan Data. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The proper variable definations for the data can be found here.
Let’s start this exploration with loading the data and saving the data into a dataframe to be inspected.
## [1] 113937
Considering that this data has 81 columns and 113937 rows, we can easily choose a chunk of this data to do analysis on. I shall start with dividing the data set into 2 considering the scenario pre-2009 and post-2009. This is also how the data is divided as columns such as the CreditGrade are only applicable for loans pre-July 2009. Thus, this data obviously tells a story of the pre-and-post 2009 scenarios and that is what I shall try to explore in this analysis.
Giving a little background on my decision to divide the data is important. From 2007 to mid 2009, the Great Financial Crisis occured followed by the Great Recession. The precipitating factor for the this Crisis was a high default rate in the United States subprime home mortgage sector i.e home loans. Critics argued that credit rating agencies and investors failed to accurately price the risk involved with mortgage-related financial products, and that governments did not adjust their regulatory practices to address 21st-century financial markets.
Moreover, accouding to Wikipedia, on November 24, 2008, the SEC found Prosper to be in violation of the Securities Act of 1933. As a result of these findings, the SEC imposed a cease and desist order on Prosper. In July 2009, Prosper reopened their website for lending (“investing”) and borrowing after having obtained SEC registration for its loans (“notes”). After the relaunch, bidding on loans was restricted to residents of 28 U.S. states and the District of Columbia. Borrowers may reside in any of 47 states, with residents of three states (Iowa, Maine, and North Dakota) not permitted to borrow through Prosper. This also highlights the divide in the data.
Thus, I will divide the data set into the pre and post financial crisis data set to get an accurate depection of the differences, specially, the credit grade and the numerous columns such as prosper score found only after 2009.
Let’s check the two dataframes
PRE JUNE 2009
## [1] 28973
## ListingNumber CreditGrade
## 0 20
## Term LoanStatus
## 0 0
## ClosedDate BorrowerAPR
## 0 25
## BorrowerRate LenderYield
## 0 0
## EstimatedEffectiveYield EstimatedLoss
## 28957 28957
## EstimatedReturn ProsperRating..numeric.
## 28957 28957
## ProsperRating..Alpha. ProsperScore
## 28957 28957
## ListingCategory..numeric. BorrowerState
## 0 5515
## Occupation EmploymentStatus
## 2255 2255
## EmploymentStatusDuration IsBorrowerHomeowner
## 7606 0
## CurrentlyInGroup GroupKey
## 0 17678
## CreditScoreRangeLower CreditScoreRangeUpper
## 591 591
## FirstRecordedCreditLine CurrentCreditLines
## 697 7604
## OpenCreditLines TotalCreditLinespast7years
## 7604 697
## OpenRevolvingAccounts OpenRevolvingMonthlyPayment
## 0 0
## InquiriesLast6Months TotalInquiries
## 697 1159
## CurrentDelinquencies AmountDelinquent
## 697 7622
## DelinquenciesLast7Years PublicRecordsLast10Years
## 990 697
## PublicRecordsLast12Months RevolvingCreditBalance
## 7604 7604
## BankcardUtilization AvailableBankcardCredit
## 7604 7544
## TotalTrades TradesNeverDelinquent..percentage.
## 7544 7544
## TradesOpenedLast6Months DebtToIncomeRatio
## 7544 1248
## IncomeRange IncomeVerifiable
## 0 0
## StatedMonthlyIncome TotalProsperLoans
## 0 26730
## TotalProsperPaymentsBilled OnTimeProsperPayments
## 26730 26730
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## 26730 26730
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## 26730 26730
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## 26731 0
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination
## 18272 0
## LoanNumber LoanOriginalAmount
## 0 0
## LoanOriginationDate LoanOriginationQuarter
## 0 0
## MonthlyLoanPayment LP_CustomerPayments
## 0 0
## LP_CustomerPrincipalPayments LP_InterestandFees
## 0 0
## LP_ServiceFees LP_CollectionFees
## 0 0
## LP_GrossPrincipalLoss LP_NetPrincipalLoss
## 0 0
## LP_NonPrincipalRecoverypayments PercentFunded
## 0 0
## Recommendations InvestmentFromFriendsCount
## 0 0
## InvestmentFromFriendsAmount Investors
## 0 0
## Date
## 0
** POST 2009 DF **
## [1] 84964
## ListingNumber CreditGrade
## 0 84964
## Term LoanStatus
## 0 0
## ClosedDate BorrowerAPR
## 58848 0
## BorrowerRate LenderYield
## 0 0
## EstimatedEffectiveYield EstimatedLoss
## 127 127
## EstimatedReturn ProsperRating..numeric.
## 127 127
## ProsperRating..Alpha. ProsperScore
## 127 127
## ListingCategory..numeric. BorrowerState
## 0 0
## Occupation EmploymentStatus
## 1333 0
## EmploymentStatusDuration IsBorrowerHomeowner
## 19 0
## CurrentlyInGroup GroupKey
## 0 82918
## CreditScoreRangeLower CreditScoreRangeUpper
## 0 0
## FirstRecordedCreditLine CurrentCreditLines
## 0 0
## OpenCreditLines TotalCreditLinespast7years
## 0 0
## OpenRevolvingAccounts OpenRevolvingMonthlyPayment
## 0 0
## InquiriesLast6Months TotalInquiries
## 0 0
## CurrentDelinquencies AmountDelinquent
## 0 0
## DelinquenciesLast7Years PublicRecordsLast10Years
## 0 0
## PublicRecordsLast12Months RevolvingCreditBalance
## 0 0
## BankcardUtilization AvailableBankcardCredit
## 0 0
## TotalTrades TradesNeverDelinquent..percentage.
## 0 0
## TradesOpenedLast6Months DebtToIncomeRatio
## 0 7306
## IncomeRange IncomeVerifiable
## 0 0
## StatedMonthlyIncome TotalProsperLoans
## 0 65122
## TotalProsperPaymentsBilled OnTimeProsperPayments
## 65122 65122
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## 65122 65122
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## 65122 65122
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## 68278 0
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination
## 78713 0
## LoanNumber LoanOriginalAmount
## 0 0
## LoanOriginationDate LoanOriginationQuarter
## 0 0
## MonthlyLoanPayment LP_CustomerPayments
## 0 0
## LP_CustomerPrincipalPayments LP_InterestandFees
## 0 0
## LP_ServiceFees LP_CollectionFees
## 0 0
## LP_GrossPrincipalLoss LP_NetPrincipalLoss
## 0 0
## LP_NonPrincipalRecoverypayments PercentFunded
## 0 0
## Recommendations InvestmentFromFriendsCount
## 0 0
## InvestmentFromFriendsAmount Investors
## 0 0
## Date
## 0
So there is an obvious difference in the loan data pre-2009 and post-2009. For starters, the number of rows in each set is quite different. I shall have to make use of ratios and percentages to do my analysis.
I want to see if there was a difference pre or post 2009 for which the loans were taken. Considering this is a discrete variable a bar chart would be a good choice for this type of analysis.
##
## 0 1 2 3 4 5 6 7
## 16945 5071 623 1874 2395 476 329 1260
##
## 0 1 2 3 5 6 7 8 9 10 11 12
## 20 53237 6810 5315 280 2243 9234 199 85 91 217 59
## 13 14 15 16 17 18 19 20
## 1996 876 1522 304 52 885 768 771
Oddly enough, individuals are taking loas for various different reasons after the crisis. This seems to be inline with the article here which highlights that since the recession is over and people have their jobs back, there seems to a trust in the industry again. So we see quite a few differences between the pre-2009 loan categories and the post-2009 loan categories.
Debt consolidation is where someone obtains a new loan to pay out a number of smaller loans, debts, or bills that they are currently making payments on. In doing this they effectively bring all these debts together into one combined loan with one monthly payment. Since this is bringing multiple debts together and combining them into one loan, this is referred to as “consolidating” them. That’s why it’s called a debt consolidation loan.
Let’s look at some of the demographics of the people who have taken out debt consolidation loans.
There is a very little change in the trend of occupations of borrowers before and after the crisis. Professionals ask for the maximum percentage of loans. A thing to note however, of the top 10, 6 of the professions are high earning professions. Now lets look at it further and see that from the whole batch, how many have taken out debt consolidation loans?
This is where we see the change in the trends of Loans before and after the crisis. * Before the crisis, a very small percentage of the loans were debt consolidation loans, with professionals have about 5% of all loans be debt consolidation loans. * After the the crisis, we see a majority of the loans are debt consolidation loans in every occupation. Taking the professional category only, there is an increase from 5% to 21% showing how most of the loans after crisis were debt consolidation loans. From the categories graph we see that almost 60% of the total loans post crisis were Debt Consolidation loans.
Heat Map plotting has been leanrt from here
## [1] "colorado" "georgia" "minnesota" "new mexico" "kansas"
## [6] "california"
The highest mean amount borrowed is by one of the smallest states i.e New Jersy while the rest of the states seem to have similar mean amounts borrowed
By the number of loans asked for, the California seems to have the highest number. New Jersey, with its high mean amount seems to be down the ladder with its low number of loan counts.
We see that many states have higher mean loan amounts as compared to before and even each other because there are many more lighter blues than before.
However, the number of loans by state seem to follow a similar patter as before with California leading the way in the number of loans after the crisis even though the mean loan amount was less.
Common legend code from here
Income range was given in a range formats, hence I chose the stated monthly income and multiplied it by 12 to get yearly income, which is much easier to play aroud with. I compared the yearly income with the borrower rate to see if there’s any connection between the two. I see that it is all over the place i.e there is no connection present, so I added the IsBorrowerHomeOwner field to check if there’s any relation. I see that yearly income is a major factor in determining if the borrower is a homeowner or not. For both before and after the crisis, an income of less than 50000 per annum gives a very low probability of having a home. Borrower Rate vs income is all over the place though. Before the crisis, the borrower rate went to 0.5 but after the crisis the borrower rate is capped at 0.3 no matter what the income.
Since the credit Grade and the Prosper Rating are the same, only categorized before and after the crisis, I decided to see how the ratings are affected by a combination of loan amount vs income. While the scatter plot is scttered, there is a marked difference between the pre and post crisis plots. * Higher Loan amounts to for the same income range after the crisis which can indicate that people are asking for more loans than before. * A marked change in Prosper ratings/ Credit Ratings on the $4000, $5000, $10000 and $20000 marks on the Y axis after the crisis while before the crisis, there isn’t a marked change seen. Infact, almost all of the blue and purple points are below the $5000 mark which can highlight a more cautious way of giving a rating to the borrower.
Now we can clearly see what the Prosper rating after the crisis was dependant on i.e Estimated Loss. The more the loss that was estimated, the worse your Prosper Rating would be. Plots 1 and 3 are important in showing this relationship, where the color changes with the change of the Estimated Loss. This highlights that the rating was based on the Estimated Loss of that could be incurred from the borrower.
I also checked the relationship with the Estimated Return column, but it is clearly seen that the there is no connection to the Rating with the return with the coloured dots all over the graph.
Another highly interesting graph to see is the plot of EstimatedEffectiveYeild vs the Estimated Return Plot. Once again we see that there is a proper change in the colour i.e the Prosper Rating showing how the Rating is affected by the Yeild. An interesting observataion is that there is a solid straight line through the graph. This is when the Yeild equals to the return and when the Loss equals 0. The colors depend then on only the Effective Yeild and not the loss.
## [1] "50.4471769486646 % of borrowers are also homeowners"
## [1] "27.4142727998806 percent of the homeowners have also taken out debt consolidations loans"
## [1] "51.1756497011506 percent of the borrowers have taken out debt consolidations loans"
I did not think there was a need to make a visual to answer this question because of its obvious nature. I am looking at numbers and thus, I can see that loans for homeowners are there and many of the homeowners have more than one loan, thus there is a need to take out debt consolidation loans.
Let’s look at the actual loans and what they are affected by:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 4500 6166 8000 25000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7500 9077 13500 35000
Before the crisis, there loan amounts taken are less than after the crisis. While the base loan amount is the same for both, we see a higher range after the crisis, with a higher mean and median.
The simplist plot i.e the boxplot also highlights this aspect with the post crisis mean and median being higher than the pre- crisis mode and median.
It seems that with the higher loan amounts of the borrowers, their income range has also increased after the crisis with the outliers so high, that I could not see the actual boxplot. It is after plotting the boxplot without the outliers do I see the trend that the income after the crisis is higher. This can be associated with the higer cost of living after the crisis and with the increase of the basic income needed to live along with inflation. All in all $1 in 2008 → $1.17 in 2018 showing how the cost of living has increased with the increase in income. This can also give reason to the higher loan prices.
Apart from the lesser number of loans taken before the crisis, we see that a very small percentage of the loan payments were less than a month late and almost none of them were more than one month late before the crisis. After the crisis though, we see that a significant percentage of loan payments were late and many even more than one month late.
For this duration, the median is less than the mean. An interesting thing to note is that there is a massive drop from the last 2008 quater to the mid 2009. These are the years of the Great Financial Crisis and this graph proves that the Loan trends were affected significantly by the Great Crisis of 2008.
The aim for the final plots is to combine a 2-4 plots into one to form a summary of the trends I have tried to highlight during the analysis.
Knowing that the debt consolidation loans make about 51.17% of the total borrowers, it is important to know what the demographics of these borrowers are. Their occupations is what I have plotted. It is also important to segament the data into before and after the crisis to highlight the drastic change in loan taking. Where debt consolidation was a minor part of the total loan categories before the Crisis, it became the number 1 reason for taking loans after the Crisis.
As seen from the graph above, where the light blue color highlights are percentage of debt consolidation loans while the dark blue color highlights the other reasons for taking loans, it can be seen wht debt consolidation makes up about 51% of the data. Prior to the crisis, there was no reason to take debt consoldation loans due to people’s income level being at par with their loan debt. However, as mentioned before, a debt consolidation loan combines all other loans into one big loan with low interest so that an individual only has to pay one monthly payment. After the crisis it can be seen that the loan taking has completely switched, with the majority being debt consolidation loans and only a minority being loans for other purposes. It should be mentioned as I know from plotting the graph before, that after the crisis people started taking loans for luxury items as well such as a boat, motor cycle and vacations. In light of this information, it is still astonishing to see the proportion of debt consolidation loans over other loans.
Coming to the occupations of the people who are asking for the loans, it can be seen that loans are mostly given to people with an already steady and high income such as Analyst, or computer programmer or Executive. These people have the highest earning jobs and thus can proof their income and ability to pay back their loans. However, it is the professional category that has the maximum number of loans in pre-crisis and post crisis categories.
Given the numerous facors given for the Prosper rating, I wanted to know which is the most important and what does the Prosper rating depend on. The data is majority after the crisis data i.e after July 2009 so I had enough data to make a conclusion.
From the given plot, which is an amalgamation of four plots which plots four different variables against each other, I can easily see that Propser Rating depends solely on Estimated Loss. The greater your estimated loss ratio, the worse your rating is given to be. Estimated loss is then calculated from a variety of different elements such as demographics (occupation, employement status, investors) and income (monthly income, other loans, number of delinquencies and amount delinquent)
I can also safely say that Prosper rating is dependant on estimated loss due to the proper lines formed when plotted with Estimated Loss while there is no lines seen with the plots of Estimated Return. This is also highlighted with the 4th graph of Estimted return vs Estimated Effective Yeild. It is obvious the two variables are correlated. An increase in Estimated Return should also give you an increase in Estimated Effective Yeild. However, we see that the two are not lineraly related. There’s a squared correlation. This is because of the Estimated Loss variablt which highlights how the Estimated Loss variable is vital in calculating the Prosper Rating.
I wanted to check how state incomes are mapped with state loans and thus, I plotted the mean of the two giving me an indication of how the states handle debt.
This chart shows the recent debt per state in the US where California, Texas and NewYork have massive amounts of Debt.
The data given, only has Prosper oans till 2014 and thus, it is imporant to see the current world scenario. As mentioned before, the incomes and loan amounts have increased 10 fold after the crisis. In the chart, we see that the states with the highest average incomes are paying the least amounts of debt, while states such as California is which have low averge income are paying a high amount for their debt. This can show the start of the Dbet crisis that the world is not facing with more defaulters than on time paying individuals.
I was interested in analyzing this particular set of data because of my interest in world affairs and the debt crisis. The Debt Crisis was particularly interesting because it has its causes stuck in the 1980s. Bad governemtn policies and irresponisble debt taking trends along with a lack of awareness seems to be the key reasons for this debt crisis. In 2008, U.S. households lost an estimated 18% of their net worth, equaling approximately an 11.2 trillion loss. In 2011, the U.S. debt reached 100% of its GDP and now the fedral goverments total debt stands at an enourmous 21.97 trillion. It’s all connected.
Thus, I undertook the effort to try and figure out the consequences of the Financial Crisis of 2008-09. I found it was tougher than expected, because I was unfamiliar with the syntax, the function calls, and the overall behavior of R. Maybe if I had done this in Jupyter Notebooks, it would be better as I am fimilair with it.
I was pleased with how easy it is to produce a simple map of the world in R using the states library. And plotting scatter plots and lines was easy enough and straight forward. But I spent a lot of time looking for advanced plotting packages that should make my life easier. After wasting a lot of time, I realized I felt compelled to code it myself and get the job done correctly. These are the problems of learning a new language so quickly.
I got hung up multiple times my code started behaving oddly. For example when I was plotting a the world map and the figure was not darwing and only the background was showing. I had to ask my fellow Bertlesmann Scholars for help and some of them were just as stuck on it as I was. On the other hand, it was nice to be able run code in a function, and access the results from the console.
At the end of it all, R programming is not a whole lot easier than Java or C (which I have done a lot of coding in), except for really simple data structures. If asked for a preference, I’d chose Python any day. However, given enough time with R as I have had with Python, it would probably be hard to decide.
I feel like I would have been able to do this project better in Python as I am more fimiliar with the syntax having being using the language as my main for the past couple of years and with the looming deadline, there was only a finite amount of time I could have spent on this project and trying to learn R.